Transcoding unicode characters with AVX‐512 instructions

نویسندگان

چکیده

Abstract Intel includes in its recent processors a powerful set of instructions capable processing 512‐bit registers with single instruction (AVX‐512). Some these have no equivalent earlier sets. We leverage to efficiently transcode strings between the most common formats: UTF‐8 and UTF‐16. With our novel algorithms, we are often twice as fast previous best solutions. For example, Chinese text from UTF‐16 at more than 5 GiB using fewer 2 CPU per character. To ensure reproducibility, make software freely available an open‐source library. Our library is part popular Node.js JavaScript runtime.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ASCII Escaping of Unicode Characters

There are a number of circumstances in which an escape mechanism is needed in conjunction with a protocol to encode characters that cannot be represented or transmitted directly. With ASCII coding, the traditional escape has been either the decimal or hexadecimal numeric value of the character, written in a variety of different ways. The move to Unicode, where characters occupy two or more octe...

متن کامل

ALI-BABA AND THE 4.0 UNICODE CHARACTERS New input and output concepts under Unicode

New input and output concepts under Unicode: Literary, Classical and Qurʾānic Arabic in Unicode involve more characters than the Arabic keyboard can accommodate. To enter sophisticated Arabic, Basis Technology and DecoType jointly developed an IME (input method editor): ALI (Arabic, Latin Input). When used at full power, ALI overstretches conventional font technology. Only DecoType’s Arabic Cal...

متن کامل

Diverse Mathematical Symbols for Arabic, Additional characters proposed to Unicode

Here are some symbols used in Arabic mathematical presentation [3] [4] but are not yet in Unicode Standard [5].

متن کامل

Rumi Numeral System Symbols, Additional characters proposed to Unicode

A special numeral system rumi has been in use in North Africa since the Xe century. It remained in use until the XVIIe century. This system has been especially used in the administration of the city of Fez in Morocco. It has also been used in Al-Andalusians, Spain, starting from the XIIe century. The forms of the digits are quiet di erent from the Arabic or the Arabic-Indic digits in use today....

متن کامل

Using Lexical tools to convert Unicode characters to ASCII.

Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the worlds writing systems. It is widely used in multilingual NLP (natural language processing) projects. On the other hand, there are some NLP projects still only dealing with ASCII characters. This paper describes methods of utilizing lexical tools to convert Unicode character...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Software - Practice and Experience

سال: 2023

ISSN: ['0038-0644', '1097-024X']

DOI: https://doi.org/10.1002/spe.3261